Skip to content

[Closed] Record: 0.0214 bpb - Low Eval-Time Memory Regime: Packed Training N-gram Artifact + Learned Gate (No Phrase Cache)#962

Closed
AnirudhRahul wants to merge 2 commits intoopenai:mainfrom
AnirudhRahul:record/low-eval-memory-no-phrase-00214
Closed

[Closed] Record: 0.0214 bpb - Low Eval-Time Memory Regime: Packed Training N-gram Artifact + Learned Gate (No Phrase Cache)#962
AnirudhRahul wants to merge 2 commits intoopenai:mainfrom
AnirudhRahul:record/low-eval-memory-no-phrase-00214

Conversation

@AnirudhRahul
Copy link
Copy Markdown

@AnirudhRahul AnirudhRahul commented Mar 27, 2026

Summary

  • Supersedes Record: 0.0498 bpb - Packed Training N-gram Artifact + Learned Weighting Gate (updated)  #931 for this line of work with the final low eval-time memory regime: the packed order-2..9 training n-gram artifact and learned gate remain, but the logistic context mixer and long phrase cache are removed from the final eval path.
  • Final 3-seed mean val_bpb is 0.02139943 +/- 0.00003918; worst-case total submission size is 15,881,331 bytes and worst-case eval time is 437s.
  • All reported runs stay within budget: training <600s, eval <600s, artifact <16MB.
  • The main auxiliary eval-time state is the fixed 2 MiB order-2..9 n-gram cache: 32K buckets with two uint32 count tables per order. This is the primary persisted state beyond the transformer itself and it does not grow with validation length.
  • The submission keeps the compliant causal path: the n-gram cache persisted from training time is included as part of the artifact itself, expert availability is context-only, GPTQ calibration uses cached training batches, the output distribution is normalized to sum to 1 for each token, and the reported path uses TTT_EPOCHS=0.

Results

Seed Final val_bpb Artifact bytes Total bytes Eval time
1337 0.02144330 15,015,946 15,179,538 432s
42 0.02136791 15,717,739 15,881,331 433s
7 0.02138708 15,083,362 15,246,954 437s

3-seed mean val_bpb: 0.02139943

Sample std: 0.00003918

Causal Inference Scheme

  1. Deserialize the packed order-2..9 n-gram cache from the submitted artifact at eval start.
  2. Score each validation chunk once using only left context and the current cache state.
  3. Query n-gram experts using left context only; the learned gate's expert-availability mask depends only on context evidence.
  4. Blend neural + n-gram experts, then renormalize the full-vocabulary distribution so it sums to 1 before scoring.
  5. Update the streaming n-gram cache only after the chunk has already been scored.
  6. Report the final single-pass path with TTT_EPOCHS=0.

Compliance

  • This is not a 2-pass method.
  • Validation is scored in a single causal pass: each chunk is scored before that chunk is used for any cache update.
  • The warm-start cache used at eval step 0 is part of the artifact itself, not a separate runtime input.
  • The n-gram cache persisted from training time is included as part of the artifact and deserialized at eval start.
  • The packed n-gram cache in the artifact is derived from training data only and is produced within the 600 second training budget.
  • The learned gate does not use the true next token to decide which experts are available.
  • GPTQ calibration runs inside the reserved pre-export budget using cached training batches from the same timed run; it does not reopen training shards after the official wallclock limit.
  • The output distribution is normalized to sum to 1 for each token before likelihood is accumulated.
  • The current reported numbers use TTT_EPOCHS=0, so there is no backward test-time adaptation in the final submission path.
  • No future validation tokens are visible when scoring the current chunk.

Reproduction

pip install -r records/track_10min_16mb/2026-03-27_LowEvalMemoryRegime_PackedTrainCache_NoMixer/requirements.txt

cd records/track_10min_16mb/2026-03-27_LowEvalMemoryRegime_PackedTrainCache_NoMixer

SEED=1337 \
DATA_PATH=/root/parameter-golf/data/datasets/fineweb10B_sp1024 \
TOKENIZER_PATH=/root/parameter-golf/data/tokenizers/fineweb_1024_bpe.model \
ARTIFACT_NGRAM_EXPORT=1 \
MAX_WALLCLOCK_SECONDS=600 \
VAL_LOSS_EVERY=0 \
USE_MIXER=0 USE_PHRASE_CACHE=0 MIXER_HEAD=multi \
USE_NGRAM_CACHE=1 NGRAM_EVAL_ORDER=9 \
TRAIN_ORACLE_BUCKETS=32768 NGRAM_EVAL_BUCKETS=32768 \
USE_REGIME_TRACKER=0 USE_LOGIT_CAL=1 \
TTT_EPOCHS=0 TTT_FREEZE_BLOCKS=2 TTT_LR=0.0001 \
TTT_CHUNK_TOKENS=131072 SKIP_SLIDING=1 EVAL_STRIDE=64 TTT_TEMPERATURE=0.85 \
CROWN_Q_LAMBDA=0.01 PRUNE_PCT=0.05 BIGRAM_VOCAB_SIZE=0 \
GPTQ_CALIBRATION_SEQS=128 \
RENORMALIZE_FINAL_PROBS=1 VERIFY_FINAL_PROBS=1 \
PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True \
torchrun --standalone --nproc_per_node=8 train_gpt.py

Submission Checklist

  • One new folder added under records/track_10min_16mb
  • README.md included
  • submission.json included
  • train_gpt.py included
  • Train logs included (train_seed1337.log, train_seed42.log, train_seed7.log)
  • Train and eval under 10 minutes
  • Artifact under 16MB
  • No tokenizer/dataset edits
  • Score-first ordering preserved (no hindsight path)

This updates the packed training n-gram artifact submission with the final no-mixer, no-phrase 3-seed reruns and documents the causal single-pass evaluation path.

Made-with: Cursor
@AnirudhRahul
Copy link
Copy Markdown
Author

This submission attempts to address some of the concerns about memory usage from ngram-caches (mentioned as being potentially problematic here #886).
And only uses an extra 2mb of memory to track its n-gram state

This replaces the prior point-scored results with renormalized 3-seed reruns so the final output distribution sums to 1 at every token and the published BPB reflects the normalized path.

Made-with: Cursor
@AnirudhRahul AnirudhRahul changed the title Record: 0.0214 bpb - Low Eval-Time Memory Regime: Packed Training N-gram Artifact + Learned Gate (No Phrase Cache) [Closed] Record: 0.0214 bpb - Low Eval-Time Memory Regime: Packed Training N-gram Artifact + Learned Gate (No Phrase Cache) Mar 27, 2026
@AnirudhRahul
Copy link
Copy Markdown
Author

#677 (comment)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant